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1. Introduction 

For classification problems, feature extraction is a crucial pro- 
cess which aims to find a suitable data representation that in- 
— ^creases the performance of the machine learning algorithm. Ac- 
^(pording to the curse of dimensionality 4 theorem, the number of 
^^amples needed for a classification task increases exponentially as 
^v^Jhe number of dimensions (variables, features) increases. On the 
other hand, it is costly to collect, store and process data. More- 
^pver, irrelevant and redundant features might hinder classifier 
/^performance. In exploratory analysis settings, high dimensional- 
ity prevents the users from exploring the data visually. Feature 
^^xtraction is a two-step process: feature construction and fea- 
^— ture selection. Feature construction creates new features based 
on the original features and feature selection is the process of 
' ^selecting the best features as in filter, wrapper and embedded 
Vwmethods [5]. 

I ^ In this work, we focus on feature construction methods that 
^jiim to decrease data dimensionality for visualization tasks. Var- 
ious linear (such as principal components analysis (PCA), mul- 
1— tiple discriminants analysis (MDA), exploratory projection pur- 
suit) and non- linear (such as multidimensional scaling (MDS), 
^"^nanifold learning, kernel PCA/LDA, evolutionary constructive 
^induction) techniques have been proposed for dimensionality re- 
^Qluction. Our algorithm is an adaptive feature extraction method 
^which consists of evolutionary constructive induction for feature 
^C onstruction and a hybrid filter/ wrapper method for feature se- 
lection. 

The Multi-Objective Genetic Programming Pro- 
O jection Pursuit (MOG3P) Algorithm 

v # j We cast the dimensionality reduction task within the genetic 
^programming framework (GP) where the goal is to simultane- 
• ^msly evolve 2 (or 3) data transformation functions that map the 
^^nput dataset into a lower dimensional representation for visual- 
ization. Each function is represented as an expression tree which 
^s made up of a number of base functions over the initial features 
and represents a ID projection of the data. 
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of interestingness consists of three equally important objectives 
(algorithm 1). These objectives are: 1) classifiability: the gen- 
erated data representation should increase the performance of 
the learning algorithm(s), 2) visual interpretability: clear class 
separability when visualized, 3) semantic interpretability: the re- 
lationships between the original and evolved features should be 
easy to comprehend (Figure [T]). 

Algorithm 1: The MOG3P Algorithm 
Input : NxO data matrix D 

Set of basis functions B ={&i, &a, .—»&K"} 
Population size P 
Fitness Function F 
Classifiability Criterion C 
Visualization Criterion V 
Semantic Criterion S 

Number of transformation expressions in each 
individual (2 or 3 for visualization) T 

Output: TxO transformed data matrix M, data 
transformation I 

Randomly create initial population of data transformations 
foreach transformation I in population do 
compute M by applying transformation to D 
compute F{M\D,C,V,S) as the fitness of the 
transformation I 
end 
repeat 

Select transformations from the population as parents 
Perform breeding- (crossover reproduction .mutation 
operations) 

foreach transformation I in population do 

compute M by applying transformation I to D 
compute F(M\D } G, V> S) as the fitness of the 
transformation I 
end 

Select best P transformations from the population of 
parent and offspring transformations 
until convergence 

3. Experiments 

In this paper, we report results on two benchmark datasets. 



Name 


^features 


^samples 


^classes 


Wisconsin Breast Cancer(WBC) 2] 


9 (ID removed) 


683 


benign:444, malignant :239 


Crabs 3] 


5 


200 


4 (50 each) 



Figure 1: MOG3P diagram 

The algorithm is named Multi-Objective Genetic Program- 
ming Projection Pursuit (MOG3P) since it searches for interest- 
ing low dimensional projections of the dataset where the measure 



Table 1: Datasets 

We first apply three widely utilized dimensionality reduction 
techniques on the dataset: principal components analysis (PCA), 
multidimensional scaling (MDS) and multiple discriminants anal- 
ysis (MDA). Then we report the 10- fold cross-validation perfor- 
mance of each classifier on these lower dimensional (2D here) 
representations of the data as well as the original dataset. PCA 
and MDA construct new features based on linear transformations 
of the original features, therefore they can not uncover non-linear 
relationships. MDS does not construct an explicit mapping be- 
tween the constructed and original features. 

MOG3P is an adaptive algorithm that aims to find the optimal 
feature representation for the given data. Each candidate data 
representation is evaluated in a hybrid wrapper/filter manner. 
Instead of just one classifier we use multiple classifiers. We exper- 
iment with WEKA( [l]) implementations of the following clas- 
sifiers: Naive Bayes, Logistic, SMO (support vector machine), 
RBF Network, IBk (k-Nearest Neighbors), Simple Cart and J48 
(decision tree). We examine three ways to compute the classi- 
fiability criterion of each individual: 1) maximum, 2) minimum 
and 3) mean accuracy (10- fold stratified cross-validation accu- 
racy) achieved by any classifier. For visual interpretability, we 



utilize a measure that is called linear discriminant analysis (LDA) 
index which is the ratio of the between-group sum of squares to 
the within-group sum of squares. For semantic interpret ability, 
we consider the total size of the data transformation expressions 
as the criterion to minimize. 

The performance of the M0G3P algorithm is evaluated us- 
ing a nested 10- fold cross validation scheme in order to assess 
generalization of the extracted features to unseen data. Table [2] 
shows the MOG3P settings. Fitness comparisons are made using 
a pareto dominance based multi-objective optimization method 
named SPEA2((6l). 



Population Size 


400 


Generations 


100 


Multi objective fitness scheme 


SPEA2 


Archive Size 


100 


Basis functions 


{+, — , *, protected/ , min, max, power, log} 


Classifiability Objective (C) 


aggregated (min, max, mean) classifier accuracy 


Visualization Objective (V) 


T=2, LDA index 


Semantic Objective (S) 


Total tree size 


Cross Validation 


10 times 10-fold cross validation (total 100 runs) 



Table 2: MOG3P Settings 

The MOG3P algorithm is a multi-objective machine learning 
technique which utilizes a population based stochastic optimiza- 
tion approach. Due to this nature, it returns a large number 
of data models. Each model contains two expressions that con- 
struct two new features from the original ones, an LDA index 
value measuring the visual separation of the class members, as 
well as training set and test set accuracy values indicating the 
impact of these new features on each classifier's performance. A 
good model can be thought of as a model that contains the short- 
est expressions with smallest LDA index value and highest over- 
all training and test set accuracy on the new feature set. Since 
there are multiple optimal models, model selection becomes a 
model mining process. The set of most informative features can 
be discovered by examining the most frequent features in the 
optimal models. Moreover, classifier selection can be performed 
by examining classifier performance across multiple models. 

Figure [2] shows visualizations of the WBC data using the stan- 
dard techniques and table [3] shows classifier performances on 
these lower dimensional representations of the dataset. 




Figure 2: Visualizations of WBC data(PCA,MDS,MDA) 

Only MDS and MDA algorithms provide a statistically signif- 
icant improvement (pairwise t-test) over the original features. 



Classifier 


PCA 


MDS 


MDA 


All 






Fitness: maximum 


Fitness: mean 




(2D) 


(2D) 


(2D) 


features 




MOG3P(2D) 


MOG3P(2D) 


MOG3P(2D) 


N. Bayes 


96.78 


97.07 


96.78 


96.34 




98.21(1.39) 


98.17(1.48) 


98.17(1.48) 


Logisi ic 


96.63 


97.07 


96.93 


96.78 




97.92(1.68) 


97.98(1.73) 


97.94(1.70) 


SMO 


96.78 


97.07 


96.63 


97.07 




97.95(1.66) 


98.04(1.7) 


97.95(1.61) 


RBF 


96.34 


96.63 


97.07 


95.75 




98.33(1.44) 


98.40(1.43) 


98.38(1.44) 


IBk 


95.32 


96.49 


96.49 


95.75 




98.48(1.44) 


98.58(1.32) 


98.61(1.43) 


CART 


96.78 


97.22 


97.07 


95.17 




98.30(1.54) 


98.39(1.52) 


98.26(1.51) 


J48 


97.22 


97.51 


96.93 


96.05 




98.32(1.50) 


98.29(1.54) 


98.2 (1.54) 



055(0.60) | 97.01(0.35) | 96.84(0.22) | 96.13(0.65)| | 98.22(1.53) | 98.26(1.54) 

Table 3: Results on WBC data 

For all fitness types, the MOG3P algorithm (table [3]) finds 
significantly better data representations (pairwise t-test) that 
increase accuracy across all classifiers compared to the three 
standard dimensionality reduction techniques and the original 
features. 

Figure [3^ a) shows the training set error-expression size trade- 
off for the models with the lowest overall test set error. The best 
models are marked as non-dominated. 
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(a) Best data models 

Results for MOG3P Fitness type: 
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Figure 3: 

Figure |3jb) shows the set of original features that were used by 
the non-dominated models. These results indicate that only four 
out of the original nine features were useful for classification. 



Figure [4] shows visualizations of the Crabs data using the stan- 
dard techniques and table ^] shows classifier performances on 
these lower dimensional representations of the dataset. 
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Figure 4: Visualizations of Crabs data(PCA,MDS,MDA) 

Only MDA generates a lower dimensional representation that 
provides a statistically significant improvement (pairwise t-test) 
over the original features as well as a visualization that shows 
clear separation between the classes. 
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MDA 


All 
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Fitness: mean 




(2D) 


(2D) 


(2D) 


features 


MOG3P(2D) 


MOG3P(2D) 


MOG3P(2D) 


X. L!ay<^ 


57.5 


67 


93.5 


38 


97.8(3.36) 


97.85(3.12) 


97.8(2.96) 


Logistic 


59.5 


63 


94.5 


96.5 


98.1(3) 


98.3(2.77) 


98.1(3.08) 


SMO 


54.5 


59 


94.5 


63.5 


97.7(3.44) 


97.8(3.43) 


97.8(3.04) 


RBF 


67 


69 


96 


49 


97.95(3.11) 


97.9(3.35) 


97.95(3.11) 


IBk 


57 


67.5 


93 


89.5 


97.6(3.3) 


97.6(3.51) 


97.45(3.52) 


CART 


57.5 


61 


94 


75.5 


97.8(3.5) 


97.35(3.99) 


97.8(3.36) 


J48 


56.5 


59 


92.5 


73.5 


97.85(3.5) 


97.65(3.59) 


97.55(3.59) 



Avg(std) || 58.5(4.03) | 63.64(4.19) | 94(1.16) | 69.36(20.93) | | 97.83(3.31) 



97.78 (3.41) | 97.78(3.24) 



Table 4: Results on Crabs data 

For all fitness cases, the MOG3P algorithm (table H| finds sig- 
nificantly better data representations (pairwise t-test) that in- 
crease accuracy across all classifiers compared to the three stan- 
dard dimensionality reduction techniques and the original fea- 
tures. 

Figure |5^a) shows the training set error-expression size trade- 
off for the models with the lowest overall test set error. The best 
models are marked as non-dominated. The results indicate that 
all of the original features were necessary for classification. 
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Figure 5: Results for MOG3P Fitness type: minimum 

4. Conclusion 

We outline an exploratory approach to data modeling that 
seeks to simultaneously optimize the human interpretability and 
the discriminative power. Different measures of interpretability 
and discriminative power can easily be incorporated into the al- 
gorithm in a multi- objective manner without forcing the user to 
make a-priori decisions on relative importance of these measures. 
The MOG3P algorithm is a data model mining tool providing the 
users with multiple optimal models aiming to help them discover 
the set of most informative features or select a classification algo- 
rithm by examining classifier performance across multiple mod- 
els. Model selection can be performed either by choosing one 
best model or an ensemble of good models. 
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